{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "# Deep Q-Learning for Atari Breakout\n",
    "\n",
    "**Author:** [Jacob Chapman](https://twitter.com/jacoblchapman)<br>\n",
    "**Date created:** 2020/05/23<br>\n",
    "**Last modified:** 2020/05/23<br>\n",
    "**Description:** Play Atari Breakout with a Deep Q-Network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Introduction\n",
    "\n",
    "This script shows an implementation of Deep Q-Learning on the\n",
    "`BreakoutNoFrameskip-v4` environment.\n",
    "\n",
    "### Deep Q-Learning\n",
    "\n",
    "As an agent takes actions and moves through an environment, it learns to map\n",
    "the observed state of the environment to an action. An agent will choose an action\n",
    "in a given state based on a \"Q-value\", which is a weighted reward based on the\n",
    "expected highest long-term reward. A Q-Learning Agent learns to perform its\n",
    "task such that the recommended action maximizes the potential future rewards.\n",
    "This method is considered an \"Off-Policy\" method,\n",
    "meaning its Q values are updated assuming that the best action was chosen, even\n",
    "if the best action was not chosen.\n",
    "\n",
    "### Atari Breakout\n",
    "\n",
    "In this environment, a board moves along the bottom of the screen returning a ball that\n",
    "will destroy blocks at the top of the screen.\n",
    "The aim of the game is to remove all blocks and breakout of the\n",
    "level. The agent must learn to control the board by moving left and right, returning the\n",
    "ball and removing all the blocks without the ball passing the board.\n",
    "\n",
    "### Note\n",
    "\n",
    "The Deepmind paper trained for \"a total of 50 million frames (that is, around 38 days of\n",
    "game experience in total)\". However this script will give good results at around 10\n",
    "million frames which are processed in less than 24 hours on a modern machine.\n",
    "\n",
    "### References\n",
    "\n",
    "- [Q-Learning](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf)\n",
    "- [Deep Q-Learning](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "from baselines.common.atari_wrappers import make_atari, wrap_deepmind\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from tensorflow import keras\n",
    "from tensorflow.keras import layers\n",
    "\n",
    "# Configuration paramaters for the whole setup\n",
    "seed = 42\n",
    "gamma = 0.99  # Discount factor for past rewards\n",
    "epsilon = 1.0  # Epsilon greedy parameter\n",
    "epsilon_min = 0.1  # Minimum epsilon greedy parameter\n",
    "epsilon_max = 1.0  # Maximum epsilon greedy parameter\n",
    "epsilon_interval = (\n",
    "    epsilon_max - epsilon_min\n",
    ")  # Rate at which to reduce chance of random action being taken\n",
    "batch_size = 32  # Size of batch taken from replay buffer\n",
    "max_steps_per_episode = 10000\n",
    "\n",
    "# Use the Baseline Atari environment because of Deepmind helper functions\n",
    "env = make_atari(\"BreakoutNoFrameskip-v4\")\n",
    "# Warp the frames, grey scale, stake four frame and scale to smaller ratio\n",
    "env = wrap_deepmind(env, frame_stack=True, scale=True)\n",
    "env.seed(seed)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Implement the Deep Q-Network\n",
    "\n",
    "This network learns an approximation of the Q-table, which is a mapping between\n",
    "the states and actions that an agent will take. For every state we'll have four\n",
    "actions, that can be taken. The environment provides the state, and the action\n",
    "is chosen by selecting the larger of the four Q-values predicted in the output layer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "num_actions = 4\n",
    "\n",
    "# Network defined by the Deepmind paper\n",
    "inputs = layers.Input(shape=(84, 84, 4,))\n",
    "\n",
    "# Convolutions on the frames on the screen\n",
    "layer1 = layers.Conv2D(32, 8, strides=4, activation=\"relu\")(inputs)\n",
    "layer2 = layers.Conv2D(64, 4, strides=2, activation=\"relu\")(layer1)\n",
    "layer3 = layers.Conv2D(64, 3, strides=1, activation=\"relu\")(layer2)\n",
    "\n",
    "layer4 = layers.Flatten()(layer3)\n",
    "\n",
    "layer5 = layers.Dense(512, activation=\"relu\")(layer4)\n",
    "action = layers.Dense(num_actions, activation=\"linear\")(layer5)\n",
    "\n",
    "# The first model makes the predictions for Q-values which are used to\n",
    "# make a action.\n",
    "model = keras.Model(inputs=inputs, outputs=action)\n",
    "# Build a target model for the prediction of future rewards.\n",
    "# The weights of a target model get updated every 10000 steps thus when the\n",
    "# loss between the Q-values is calculated the target Q-value is stable.\n",
    "model_target = keras.Model(inputs=inputs, outputs=action)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Train\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# In the Deepmind paper they use RMSProp however then Adam optimizer\n",
    "# improves training time\n",
    "optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)\n",
    "\n",
    "# Experience replay buffers\n",
    "action_history = []\n",
    "state_history = []\n",
    "state_next_history = []\n",
    "rewards_history = []\n",
    "done_history = []\n",
    "episode_reward_history = []\n",
    "running_reward = 0\n",
    "episode_count = 0\n",
    "frame_count = 0\n",
    "# Number of frames to take random action and observe output\n",
    "epsilon_random_frames = 50000\n",
    "# Number of frames for exploration\n",
    "epsilon_greedy_frames = 1000000.0\n",
    "# Maximum replay length\n",
    "# Note: The Deepmind paper suggests 1000000 however this causes memory issues\n",
    "max_memory_length = 100000\n",
    "# Train the model after 4 actions\n",
    "update_after_actions = 4\n",
    "# How often to update the target network\n",
    "update_target_network = 10000\n",
    "\n",
    "while True:  # Run until solved\n",
    "    state = np.array(env.reset())\n",
    "    episode_reward = 0\n",
    "\n",
    "    for timestep in range(1, max_steps_per_episode):\n",
    "        # env.render(); Adding this line would show the attempts\n",
    "        # of the agent in a pop up window.\n",
    "        frame_count += 1\n",
    "\n",
    "        # Use epsilon-greedy for exploration\n",
    "        if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:\n",
    "            # Take random action\n",
    "            action = np.random.choice(num_actions)\n",
    "        else:\n",
    "            # Predict action Q-values\n",
    "            # From environment state\n",
    "            state_tensor = tf.convert_to_tensor(state)\n",
    "            state_tensor = tf.expand_dims(state_tensor, 0)\n",
    "            action_probs = model(state_tensor, training=False)\n",
    "            # Take best action\n",
    "            action = tf.argmax(action_probs[0]).numpy()\n",
    "\n",
    "        # Decay probability of taking random action\n",
    "        epsilon -= epsilon_interval / epsilon_greedy_frames\n",
    "        epsilon = max(epsilon, epsilon_min)\n",
    "\n",
    "        # Apply the sampled action in our environment\n",
    "        state_next, reward, done, _ = env.step(action)\n",
    "        state_next = np.array(state_next)\n",
    "\n",
    "        episode_reward += reward\n",
    "\n",
    "        # Save actions and states in replay buffer\n",
    "        action_history.append(action)\n",
    "        state_history.append(state)\n",
    "        state_next_history.append(state_next)\n",
    "        done_history.append(done)\n",
    "        rewards_history.append(reward)\n",
    "        state = state_next\n",
    "\n",
    "        # Update every fourth frame and once batch size is over 32\n",
    "        if frame_count % update_after_actions == 0 and len(done_history) > batch_size:\n",
    "\n",
    "            # Get indices of samples for replay buffers\n",
    "            indices = np.random.choice(range(len(done_history)), size=batch_size)\n",
    "\n",
    "            # Using list comprehension to sample from replay buffer\n",
    "            state_sample = np.array([state_history[i] for i in indices])\n",
    "            state_next_sample = np.array([state_next_history[i] for i in indices])\n",
    "            rewards_sample = [rewards_history[i] for i in indices]\n",
    "            action_sample = [action_history[i] for i in indices]\n",
    "            done_sample = tf.convert_to_tensor(\n",
    "                [float(done_history[i]) for i in indices]\n",
    "            )\n",
    "\n",
    "            # Build the updated Q-values for the sampled future states\n",
    "            # Use the target model for stability\n",
    "            future_rewards = model_target.predict(state_next_sample)\n",
    "            # Q value = reward + discount factor * expected future reward\n",
    "            updated_q_values = rewards_sample + gamma * tf.reduce_max(\n",
    "                future_rewards, axis=1\n",
    "            )\n",
    "\n",
    "            # If final frame set the last value to -1\n",
    "            updated_q_values = updated_q_values * (1 - done_sample) - done_sample\n",
    "\n",
    "            # Create a mask so we only calculate loss on the updated Q-values\n",
    "            masks = tf.one_hot(action_sample, num_actions)\n",
    "\n",
    "            with tf.GradientTape() as tape:\n",
    "                # Train the model on the states and updated Q-values\n",
    "                q_values = model(state_sample)\n",
    "\n",
    "                # Apply the masks to the Q-values to get the Q-value for action taken\n",
    "                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)\n",
    "                # Calculate loss between new Q-value and old Q-value\n",
    "                # Clip the deltas using huber loss for stability\n",
    "                loss = tf.reduce_mean(keras.losses.huber(updated_q_values, q_action))\n",
    "\n",
    "            # Backpropagation\n",
    "            grads = tape.gradient(loss, model.trainable_variables)\n",
    "            optimizer.apply_gradients(zip(grads, model.trainable_variables))\n",
    "\n",
    "        if frame_count % 10000 == 0:\n",
    "            # update the the target network with new weights\n",
    "            model_target.set_weights(model.get_weights())\n",
    "            # Log details\n",
    "            template = \"running reward: {:.2f} at episode {}, frame count {}\"\n",
    "            print(template.format(running_reward, episode_count, frame_count))\n",
    "\n",
    "        # Limit the state and reward history\n",
    "        if len(rewards_history) > max_memory_length:\n",
    "            del rewards_history[:1]\n",
    "            del state_history[:1]\n",
    "            del state_next_history[:1]\n",
    "            del action_history[:1]\n",
    "            del done_history[:1]\n",
    "\n",
    "        if done:\n",
    "            break\n",
    "\n",
    "    # Update running reward to check condition for solving\n",
    "    episode_reward_history.append(episode_reward)\n",
    "    if len(episode_reward_history) > 100:\n",
    "        del episode_reward_history[:1]\n",
    "    running_reward = np.mean(episode_reward_history)\n",
    "\n",
    "    episode_count += 1\n",
    "\n",
    "    if running_reward > 40:  # Condition to consider the task solved\n",
    "        print(\"Solved at episode {}!\".format(episode_count))\n",
    "        break\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Visualizations\n",
    "In early stages of training:\n",
    "![Imgur](https://i.imgur.com/X8ghdpL.gif)\n",
    "\n",
    "In later stages of training:\n",
    "![Imgur](https://i.imgur.com/Z1K6qBQ.gif)\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "deep_q_network_breakout",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
